# This cell is used for creating a button that hides/unhides code cells to quickly look only the results.
# Works only with Jupyter Notebooks.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
# Description:
# Exercise10 notebook.
#
# Copyright (C) 2019 Antti Parviainen
# Based on the SSD PyTorch implementation by Max deGroot and Ellis Brown
#
# This software is distributed under the GNU General Public
# Licence (version 2 or later);
import os
import numpy as np
from matplotlib import pyplot as plt
import cv2
import torch
from torch.autograd import Variable
from ssd import build_ssd
from data import VOCDetection, VOCAnnotationTransform
from data import VOC_CLASSES as labels
import warnings
warnings.filterwarnings("ignore")
# Select data directory
if os.path.isdir('/coursedata'):
course_data_dir = '/coursedata'
docker = False
elif os.path.isdir('../../../coursedata'):
docker = True
course_data_dir = '../../../coursedata'
else:
# Specify course_data_dir on your machine
docker = True
course_data_dir = '/home/jovyan/work/coursedata/'
print('The data directory is %s' % course_data_dir)
data_dir = os.path.join(course_data_dir, 'exercise-10-data')
print('Data stored in %s' % data_dir)
The data directory is /coursedata Data stored in /coursedata/exercise-10-data
The problems should be solved before the exercise session and solutions returned via
MyCourses. Upload to MyCourses both: this Jupyter Notebook (.ipynb) file containing your solutions and the exported pdf version of this Notebook file. If there are both programming and pen & paper tasks kindly combine the two pdf files (your scanned/LaTeX solutions and the exported Notebook) into a single pdf and submit that with the Notebook (.ipynb) file.
Note that you should be sure that everything that you need to implement works with the pictures specified in this exercise round.
Docker users:
coursedata/exercise-10-data/weights.The goal of this task is to learn the basics of deep learning based object detection with SSD by experimenting with the provided code and by reading the original Single Shot MultiBox Detector publication by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang, and Alexander C. Berg from 2016.
Read the research paper linked above and experiment with the provided sample code according to the instructions below. Then answer the questions a), b), c) below and return your answers. Note that scientific publications are written for domain experts and some details may be challenging to understand if necessary background information is missing. However, don’t worry if you don’t understand all details. You should be able to grasp the overall idea and answer the questions even if some details would be difficult to understand.
The code implements the following steps to demonstrate SSD:
The authors' contributions in introducing SSD (Single Shot MultiBox Detector) can be summarized as follows:
Faster and More Accurate Detection: SSD is presented as a single-shot detector for multiple categories that surpasses the speed of previous state-of-the-art single-shot detectors like YOLO. Importantly, it achieves comparable accuracy to slower techniques that involve explicit region proposals and pooling, including Faster R-CNN.
Core Architecture: The fundamental aspect of SSD is its prediction mechanism, where it forecasts category scores and box offsets for a predefined set of default bounding boxes. This is accomplished using small convolutional filters applied to feature maps.
Multi-Scale Predictions with Aspect Ratio Separation: SSD enhances detection accuracy by generating predictions at different scales from feature maps of varying scales. Additionally, the authors explicitly separate predictions based on aspect ratio, contributing to the model's ability to handle objects of diverse shapes.
End-to-End Training and High Accuracy: The design choices in SSD lead to a straightforward end-to-end training process. Despite working well on low-resolution input images, SSD maintains high accuracy, thus improving the trade-off between speed and accuracy.
Comprehensive Experiments: The authors conducted experiments involving timing and accuracy analyses on models with different input sizes. These evaluations were performed on widely recognized datasets, including PASCAL VOC, COCO, and ILSVRC. The results were compared against a range of recent state-of-the-art approaches in object detection.
The purpose of the base network in SSD is to provide a foundational architecture for high-quality image classification. It serves as the early layers of the network and is truncated before any classification layers. The base network is responsible for extracting hierarchical features from input images, which can then be used for subsequent object detection tasks. The authors use VGG16 as their truncated base network and they pre-train it using the ILSVRC CLS-LOC dataset.
The loss function in SSD incorporates two essential attributes: localization loss (loc) and confidence loss (conf). The localization loss is a Smooth L1 loss between predicted bounding box parameters and ground truth box parameters, facilitating accurate object localization. Meanwhile, the confidence loss is the softmax loss over multiple classes confidences, ensuring precise object classification through softmax loss. Together, these attributes form the overarching objective loss function in SSD, guiding the network to minimize errors in both bounding box prediction and object classification during the training process.
net = build_ssd('test', 300, 21) # initialize SSD
net.load_weights(data_dir+'/weights/ssd300_mAP_77.43_v2.pth')
Loading weights into state dict... Finished!
images = []
for i in range(4):
if docker:
# if local storage use a 300 image subset of the test set
img_id = np.random.randint(1,high = 300)
image = cv2.imread(data_dir+'/voc_images/'+f"{img_id:06d}"'.jpg', cv2.IMREAD_COLOR)
else:
# if JupyterLab use the full test set
# here we specify year (07 or 12) and dataset ('test', 'val', 'train')
testset = VOCDetection(data_dir+'/VOCdevkit', [('2007', 'val')], None, VOCAnnotationTransform())
img_id = np.random.randint(0,high = len(testset))
image = testset.pull_image(img_id)
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
images.append(rgb_image)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,10))
ax = axes.ravel()
ax[0].imshow(images[0])
ax[0].axis('off')
ax[1].imshow(images[1])
ax[1].axis('off')
ax[2].imshow(images[2])
ax[2].axis('off')
ax[3].imshow(images[3])
ax[3].axis('off')
plt.tight_layout()
plt.suptitle("Randomly sampled images from the dataset", fontsize=20)
plt.subplots_adjust(top=0.95)
plt.show()
Using the torchvision package, we can apply multiple built-in transorms. At test time we resize our image to 300x300, subtract the dataset's mean rgb values, and swap the color channels for input to SSD300.
def preprocess_inputs(images):
preprocessed_images = []
for image in images:
x = cv2.resize(image, (300, 300)).astype(np.float32)
x -= (104.0, 117.0, 123.0)
x = x.astype(np.float32)
preprocessed_images.append(x)
return preprocessed_images
preprocessed_images = preprocess_inputs(images)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,10))
ax = axes.ravel()
ax[0].imshow(preprocessed_images[0])
ax[0].axis('off')
ax[1].imshow(preprocessed_images[1])
ax[1].axis('off')
ax[2].imshow(preprocessed_images[2])
ax[2].axis('off')
ax[3].imshow(preprocessed_images[3])
ax[3].axis('off')
plt.tight_layout()
plt.suptitle("Preprocessed input images", fontsize=20)
plt.subplots_adjust(top=0.95)
plt.show()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
def run_network(images, nrows, ncols, figsize = (10,10), threshold = 0.6, title=True):
if nrows * ncols != len(images):
print("Subgrid dimensions don't match with the number of images.")
return
preprocessed_images = preprocess_inputs(images)
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
if len(images) != 1:
ax = axes.ravel()
else:
ax = [axes]
for it, input_image in enumerate(images):
# Process data for the network
# swap color channels
x = preprocessed_images[it][:, :, ::-1].copy()
# change the order
x = torch.from_numpy(x).permute(2, 0, 1)
# wrap the image in a Variable so it is recognized by PyTorch autograd
xx = Variable(x.unsqueeze(0))
if torch.cuda.is_available():
xx = xx.cuda()
# SSD Forward Pass
y = net(xx)
detections = y.data
# colormap for the bounding boxes
colors = plt.cm.hsv(np.linspace(0, 1, 21)).tolist()
# scale each detection back up to the image
scale = torch.Tensor(input_image.shape[1::-1]).repeat(2)
for i in range(detections.size(1)):
j = 0
while detections[0,i,j,0] >= threshold:
score = detections[0,i,j,0]
label_name = labels[i-1]
display_txt = '%s: %.2f'%(label_name, score)
pt = (detections[0,i,j,1:]*scale).cpu().numpy()
coords = (pt[0], pt[1]), pt[2]-pt[0]+1, pt[3]-pt[1]+1
color = colors[i]
ax[it].add_patch(plt.Rectangle(*coords, fill=False, edgecolor=color, linewidth=2))
ax[it].text(pt[0], pt[1], display_txt, bbox={'facecolor':color, 'alpha':0.5})
j+=1
ax[it].imshow(images[it])
ax[it].axis('off')
plt.tight_layout()
if title:
plt.suptitle("Detection results at threshold: %1.2f" %threshold, fontsize=20)
plt.subplots_adjust(top=0.95)
plt.show()
# Filter outputs with confidence scores lower than a threshold. The default threshold is 60%.
run_network(images, nrows=2, ncols=2, figsize=(10,10), threshold=0.6)
To better detect challenging inputs the authors have implemented data augmentation described in chapter 2.2 and 3.2 of the publication and in reference 14. Read the chapters in the publication and selectively read the main points of reference 14 and answer the following questions.
The provided results suggest that the network performs better in detecting bigger objects that are surrounded by fewer elements or distractions.
Switching from an input size of 300x300 to 512x512 could have several implications on the performance of the object detection model. Generally, increasing the input resolution might lead to better accuracy, especially in detecting smaller objects and capturing finer details. However, this improvement might cost the increase of the computational complexity and slower processing speed. The authors mention that for a 512x512 input, SSD achieves 76.8% mAP, whereas for 300x300 it achieves only 74.3%, suggesting a positive impact on accuracy.
As for designing an object detector that uses HD resolution (1920x1080 or higher) images as inputs could have negative implications on the performance of the model. While higher resolution images can potentially provide more detailed information for accurate object detection, they also significantly increase computational demands. Processing HD images requires more powerful hardware, and real-time applications may face challenges in achieving desirable speed.
# Photo credit: to burningman.org. Used under the fair use principles for transformative educational purposes.
image11 = cv2.imread(data_dir+'/test_images/img11.jpg', cv2.IMREAD_COLOR)
images1=[cv2.cvtColor(image11, cv2.COLOR_BGR2RGB)]
run_network(images1, nrows=1, ncols=1, figsize=(10,10), threshold=0.6)
The detector seems to struggle when two or more objects overlap, such as the cat and hat or the group of dogs. This issue is particularly evident when objects are close together or when they share common boundaries, making it challenging for the detector to distinguish and accurately localize individual objects within the overlaid regions.
Yes, the authors of the SSD publication have attempted to address this problem. They implemented a data augmentation strategy described in Section 2.2, which involves generating random crops of the images. These random crops act as a "zoom in" operation, providing larger training examples and aiding in the detection of small objects. Additionally, the authors introduced a "zoom out" operation, where the image is placed on a canvas of 16× the original image size before undergoing any random crop operation. This augmentation trick, referred to as "expansion," increases the number of small training examples and has shown consistent improvements of 2%-3% mAP across multiple datasets.
# Photo credit: free wallpaper image. Used under the fair use principles for transformative educational purposes.
image21 = cv2.imread(data_dir+'/test_images/img21.jpg', cv2.IMREAD_COLOR)
# Photo credit: David Dodman, KNOM. Used under the fair use principles for transformative educational purposes.
image22 = cv2.imread(data_dir+'/test_images/img22.jpg', cv2.IMREAD_COLOR)
images2=[cv2.cvtColor(image21, cv2.COLOR_BGR2RGB),cv2.cvtColor(image22, cv2.COLOR_BGR2RGB)]
run_network(images2, nrows=2, ncols=1, figsize=(25,25), threshold=0.6)
Again, the detector appears to face challenges in detecting objects that overlap with each other in the test images3.
Yes, the authors of the SSD publication have implemented data augmentation techniques to address the challenge of overlapping objects. During training, each image is randomly sampled using various options, including using the entire original input image, sampling patches with different minimum Jaccard overlaps (0.1, 0.3, 0.5, 0.7, or 0.9), and randomly sampling patches. The size of each sampled patch is a fraction of the original image size, and the aspect ratio is between 1/2 and 2. After sampling, each patch is resized to a fixed size and may be horizontally flipped with a probability of 0.5. These augmentation strategies aim to expose the model to diverse scenarios, including overlapping objects, making it more robust to such challenging situations.
# Photo credit: Duncan Rawlinson - Duncan.co photo from flickr. Creative Commons license. https://creativecommons.org/licenses/by-nc/2.0/
image31 = cv2.imread(data_dir+'/test_images/img31.jpg', cv2.IMREAD_COLOR)
# Photo credit: Karri Huhtanen from flickr. Creative Commons license. https://creativecommons.org/licenses/by-nc/2.0/
image32 = cv2.imread(data_dir+'/test_images/img32.jpg', cv2.IMREAD_COLOR)
images3=[cv2.cvtColor(image31, cv2.COLOR_BGR2RGB),cv2.cvtColor(image32, cv2.COLOR_BGR2RGB)]
run_network(images3, nrows=2, ncols=1, figsize=(15,15), threshold=0.6)
In the first image (upper), the detector faces challenges in detecting objects likely due to their small size and close proximity to each other. These factors make it difficult for the model to accurately distinguish and localize individual objects in such a crowded and compact arrangement. In contrast, the second image (lower), which contains a zoomed-in section of the first image, allows for larger and more separated objects. This makes it comparatively easier for the detector to identify and detect the objects.
Yes, the authors have attempted to address the challenge of detecting small and closely spaced objects by implementing a data augmentation technique using "zoom in" and "zoom out" operations. The "zoom in" operation is achieved through random cropping, generating larger training examples to improve the model's performance, especially on small objects. Conversely, the "zoom out" operation involves randomly placing an image on a canvas of 16 times the original image size before cropping, creating more small training examples. This data augmentation strategy helps enhance the model's accuracy, particularly in scenarios with small and densely packed objects.
# Photo credit: Brad Templeton. Used under the fair use principles for transformative educational purposes.
image41 = cv2.imread(data_dir+'/test_images/img41.jpg', cv2.IMREAD_COLOR)
image42 = cv2.imread(data_dir+'/test_images/img42.jpg', cv2.IMREAD_COLOR)
images4 = [cv2.cvtColor(image41, cv2.COLOR_BGR2RGB), cv2.cvtColor(image42, cv2.COLOR_BGR2RGB)]
# run the first image without title information to show it properly
run_network([images4[0]], nrows=1, ncols=1, figsize=(100,100), threshold=0.6, title=False)